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Abstract 

While neural networks have been success¬ 
fully applied to many NLP tasks the re¬ 
sulting vector-based models are very diffi¬ 
cult to interpret. For example it’s not clear 
how they achieve compositionality, build¬ 
ing sentence meaning from the meanings 
of words and phrases. In this paper we 
describe strategies for visualizing composi- 
tionality in neural models for NLP, inspired 
by similar work in computer vision. We 
first plot unit values to visualize composi- 
tionality of negation, intensification, and 
concessive clauses, allowing us to see well- 
known markedness asymmetries in nega¬ 
tion. We then introduce methods for visu¬ 
alizing a unit’s salience, the amount that it 
contributes to the final composed meaning 
from first-order derivatives. Our general- 
purpose methods may have wide applica¬ 
tions for understanding compositionality 
and other semantic properties of deep net¬ 
works. 

1 Introduction 

Neural models match or outperform the perfor¬ 
mance of other state-of-the-art systems on a va¬ 
riety of NLP tasks. Yet unlike traditional feature- 
based classifiers that assign and optimize weights 
to varieties of human interpretable features (parts- 
of-speech, named entities, word shapes, syntactic 
parse features etc) the behavior of deep learning 
models is much less easily interpreted. Deep learn¬ 
ing models mainly operate on word embeddings 
(low-dimensional, continuous, real-valued vectors) 
through multi-layer neural architectures, each layer 
of which is characterized as an array of hidden neu¬ 
ron units. It is unclear how deep learning models 
deal with composition, implementing functions like 
negation or intensification, or combining meaning 
from different parts of the sentence, filtering away 


the informational chaff from the wheat, to build 
sentence meaning. 

In this paper, we explore multiple strategies to 
interpret meaning composition in neural models. 
We employ traditional methods like representation 
plotting, and introduce simple strategies for measur¬ 
ing how much a neural unit contributes to meaning 
composition, its ‘salience’ or importance using first 
derivatives. 

Visualization techniques/models represented in 
this work shed important light on how neural mod¬ 
els work: For example, we illustrate that LSTM’s 
success is due to its ability in maintaining a much 
sharper focus on the important key words than other 
models; Composition in multiple clauses works 
competitively, and that the models are able to cap¬ 
ture negative asymmetry, an important property 
of semantic compositionally in natural language 
understanding; there is sharp dimensional local¬ 
ity, with certain dimensions marking negation and 
quantification in a manner that was surprisingly 
localist. Though our attempts only touch superfi¬ 
cial points in neural models, and each method has 
its pros and cons, together they may offer some 
insights into the behaviors of neural models in lan¬ 
guage based tasks, marking one initial step toward 
understanding how they achieve meaning composi¬ 
tion in natural language processing. 

The next section describes some visualization 
models in vision and NLP that have inspired this 
work. We describe datasets and the adopted neu¬ 
ral models in Section 3. Different visualization 
strategies and correspondent analytical results are 
presented separately in Section 4,5,6, followed by 
a brief conclusion. 

2 A Brief Review of Neural Visualization 

Similarity is commonly visualized graphically, gen¬ 
erally by projecting the embedding space into two 
dimensions and observing that similar words tend 
to be clustered together (e.g., Elman (1989), Ji 



and Eisenstein (2014), Faruqui and Dyer (2014)). 
(Karpathy et al., 2015) attempts to interpret recur¬ 
rent neural models from a statical point of view 
and does deeply touch compositionally of mean¬ 
ings. Other relevant attempts include (Fyshe et al., 
2015; Faruqui et al., 2015). 

Methods for interpreting and visualizing neu¬ 
ral models have been much more significantly ex¬ 
plored in vision, especially for Convolutional Neu¬ 
ral Networks (CNNs or ConvNets) (Krizhevsky et 
al., 2012), multi-layer neural networks in which the 
original matrix of image pixels is convolved and 
pooled as it is passed on to hidden layers. ConvNet 
visualizing techniques consist mainly in mapping 
the different layers of the network (or other fea¬ 
tures like SIFT (Lowe, 2004) and HOG (Dalai and 
Triggs, 2005)) back to the initial image input, thus 
capturing the human-interpretable information they 
represent in the input, and how units in these layers 
contribute to any final decisions (Simonyan et al., 
2013; Mahendran and Vedaldi, 2014; Nguyen et al., 
2014; Szegedy et al., 2013; Girshick et al., 2014; 
Zeiler and Fergus, 2014). Such methods include: 

(1) Inversion: Inverting the representations by 
training an additional model to project outputs from 
different neural levels back to the initial input im¬ 
ages (Mahendran and Vedaldi, 2014; Vondrick et 
al., 2013; Weinzaepfel et al., 2011). The intuition 
behind reconstruction is that the pixels that are re- 
constructable from the current representations are 
the content of the representation. The inverting 
algorithms allow the current representation to align 
with corresponding parts of the original images. 

(2) Back-propagation (Erhan et al., 2009; Si- 
monyan et al., 2013) and Deconvolutional Net¬ 
works (Zeiler and Fergus, 2014): Errors are back 
propagated from output layers to each intermedi¬ 
ate layer and finally to the original image inputs. 
Deconvolutional Networks work in a similar way 
by projecting outputs back to initial inputs layer by 
layer, each layer associated with one supervised 
model for projecting upper ones to lower ones 
These strategies make it possible to spot active 
regions or ones that contribute the most to the final 
classification decision. 

(3) Generation: This group of work generates 
images in a specific class from a sketch guided by 
already trained neural models (Szegedy et al., 2013; 
Nguyen et al., 2014). Models begin with an image 
whose pixels are randomly initialized and mutated 
at each step. The specific layers that are activated 
at different stages of image construction can help 


in interpretation. 

While the above strategies inspire the work we 
present in this paper, there are fundamental dif¬ 
ferences between vision and NLP. In NLP words 
function as basic units, and hence (word) vectors 
rather than single pixels are the basic units. Se¬ 
quences of words (e.g., phrases and sentences) are 
also presented in a more structured way than ar¬ 
rangements of pixels. In parallel to our research, 
independent researches (Karpathy et al., 2015) have 
been conducted to explore similar direction from 
an error-analysis point of view, by analyzing pre¬ 
dictions and errors from a recurrent neural models. 
Other distantly relevant works include: Murphy et 
al. (2012; Fyshe et al. (2015) used an manual task 
to quantify the interpretability of semantic dimen¬ 
sions by presetting human users with a list of words 
and ask them to choose the one that does not belong 
to the list. Faruqui et al. (2015). Similar strategy 
is adopted in (Faruqui et al., 2015) by extracting 
top-ranked words in each vector dimension. 

3 Datasets and Neural Models 

We explored two datasets on which neural models 
are trained, one of which is of relatively small scale 
and the other of large scale. 

3.1 Stanford Sentiment Treebank 

Stanford Sentiment Treebank is a benchmark 
dataset widely used for neural model evaluations. 
The dataset contains gold-standard sentiment labels 
for every parse tree constituent, from sentences to 
phrases to individual words, for 215,154 phrases 
in 11,855 sentences. The task is to perform both 
fine-grained (very positive, positive, neutral, nega¬ 
tive and very negative) and coarse-grained (positive 
vs negative) classification at both the phrase and 
sentence level. For more details about the dataset, 
please refer to Socher et al. (2013). 

While many studies on this dataset use recursive 
parse-tree models, in this work we employ only 
standard sequence models (RNNs and LSTMs) 
since these are the most widely used current neu¬ 
ral models, and sequential visualization is more 
straightforward. We therefore first transform each 
parse tree node to a sequence of tokens. The 
sequence is first mapped to a phrase/sentence 
representation and fed into a softmax classifier. 
Phrase/sentence representations are built with the 
following three models: Standard Recurrent Se¬ 
quence with TANK activation functions, LSTMs and 



Bidirectional LSTMs. For details about the three 
models, please refer to Appendix. 

Training AdaGrad with mini-batch was used for 
training, with parameters (L2 penalty, learning rate, 
mini batch size) tuned on the development set. The 
number of iterations is treated as a variable to tune 
and parameters are harvested based on the best 
performance on the dev set. The number of dimen¬ 
sions for the word and hidden layer are set to 60 
with 0.1 dropout rate. Parameters are tuned on the 
dev set. The standard recurrent model achieves 
0.429 (fine grained) and 0.850 (coarse grained) 
accuracy at the sentence level; LSTM achieves 
0.469 and 0.870, and Bidirectional LSTM 0.488 
and 0.878, respectively. 

3.2 Sequence-to-Sequence Models 

Seq2Seq are neural models aiming at generating 
a sequence of output texts given inputs. Theoret¬ 
ically, Seq2Seq models can be adapted to NLP 
tasks that can be formalized as predicting outputs 
given inputs and serve for different purposes due 
to different inputs and outputs, e.g., machine trans¬ 
lation where inputs correspond to source sentences 
and outputs to target sentences (Sutskever et al., 
2014; Luong et al., 2014); conversational response 
generation if inputs correspond to messages and 
outputs correspond to responses (Vinyals and Le, 
2015; Li et al., 2015). Seq2Seq need to be trained 
on massive amount of data for implicitly semantic 
and syntactic relations between pairs to be learned. 

Seq2Seq models map an input sequence to 
a vector representation using LSTM models and 
then sequentially predicts tokens based on the pre¬ 
obtained representation. The model defines a dis¬ 
tribution over outputs (Y) and sequentially predicts 
tokens given inputs (X) using a softmax function. 
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where /(/it-i, denotes the activation function 
between ht-i and where ht-i is the represen¬ 
tation output from the LSTM at time t — 1. For 
each time step in word prediction, Seq2Seq mod¬ 
els combine the current token with previously built 
embeddings for next-step word prediction. 

For easy visualization purposes, we turn to the 
most straightforward task—autoencoder— where 


inputs and outputs are identical. The goal of an 
autoencoder is to reconstruct inputs from the pre¬ 
obtained representation. We would like to see how 
individual input tokens affect the overall sentence 
representation and each of the tokens to predict in 
outputs. We trained the auto-encoder on a subset 
of WMT’14 corpus containing 4 million english 
sentences with an average length of 22.5 words. We 
followed training protocols described in (Sutskever 
et al., 2014). 

4 Representation Plotting 

We begin with simple plots of representations to 
shed light on local compositions using Stanford 
Sentiment Treebank. 

Local Composition Figure 1 shows a 60d heat- 
map vector for the representation of selected 
words/phrases/sentences, with an emphasis on ex¬ 
tent modifications (adverbial and adjectival) and 
negation. Embeddings for phrases or sentences are 
attained by composing word representations from 
the pretrained model. 

The intensification part of Figure 1 shows sug¬ 
gestive patterns where values for a few dimensions 
are strengthened by modifiers like “a lot” (the red 
bar in the first example) “so much” (the red bar in 
the second example), and “incredibly”. Though the 
patterns for negations are not as clear, there is still a 
consistent reversal for some dimensions, visible as 
a shift between blue and red for dimensions boxed 
on the left. 

We then visualize words and phrases using t- 
sne (Van der Maaten and Hinton, 2008) in Figure 2, 
deliberately adding in some random words for com¬ 
parative purposes. As can be seen, neural models 
nicely learn the properties of local composition- 
ally, clustering negation+positive words (‘not nice’, 
’not good’) together with negative words. Note also 
the asymmetry of negation: “not bad” is clustered 
more with the negative than the positive words (as 
shown both in Figure 1 and 2). This asymmetry 
has been widely discussed in linguistics, for exam¬ 
ple as arising from markedness, since ‘good’ is the 
unmarked direction of the scale (Clark and Clark, 
1977; Horn, 1989; Fraenkel and Schul, 2008). This 
suggests that although the model does seem to fo¬ 
cus on certain units for negation in Figure 1, the 
neural model is not just learning to apply a fixed 
transform for ‘not’ but is able to capture the subtle 
differences in the composition of different words. 
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Figure 2: t-SNE Visualization on latent representations for modifications and negations. 
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Figure 4: t-SNE Visualization for clause composition. 


Concessive Sentences In concessive sentences, 
two clauses have opposite polarities, usually re¬ 
lated by a contrary-to-expectation implicature. We 
plot evolving representations over time for two con- 
cessives in Figure 3. The plots suggest: 

1. For tasks like sentiment analysis whose goal 
is to predict a specific semantic dimension (as op¬ 
posed to general tasks like language model word 
prediction), too large a dimensionality leads to 
many dimensions non-functional (with values close 
to 0), causing two sentences of opposite sentiment 
to differ only in a few dimensions. This may ex¬ 
plain why more dimensions don’t necessarily lead 
to better performance on such tasks (For example, 
as reported in (Socher et al., 2013), optimal perfor¬ 
mance is achieved when word dimensionality is set 
to between 25 and 35). 

2. Both sentences contain two clauses connected 
by the conjunction “though”. Such two-clause sen¬ 
tences might either work collaboratively — models 


would remember the word “though” and make the 
second clause share the same sentiment orienta¬ 
tion as first—or competitively, with the stronger 
one dominating. The region within dotted line in 
Figure 3(a) favors the second assumption: the dif¬ 
ference between the two sentences is diluted when 
the final words (“interesting” and “boring”) appear. 

Clause Composition In Figure 4 we explore this 
clause composition in more detail. Representations 
move closer to the negative sentiment region by 
adding negative clauses like “although it had bad 
acting” or “but it is too long” to the end of a simply 
positive “I like the movie”. By contrast, adding 
a concessive clause to a negative clause does not 
move toward the positive; “I hate X but...” is still 
very negative, not that different than “I hate X”. 
This difference again suggests the model is able 
to capture negative asymmetry (Clark and Clark, 
1977; Horn, 1989; Fraenkel and Schul, 2008). 
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Figure 5: Saliency heatmap for for “I hate the movie Each row corresponds to saliency scores for the correspondent word 
representation with each grid representing each dimension. 
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Figure 6: Saliency heatmap for ‘T hate the movie I saw last night. 
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Figure 7: Saliency heatmap for ‘T hate the movie though the plot is interesting . 


5 First-Derivative Saliency 

In this section, we describe another strategy which 
is is inspired by the back-propagation strategy in 
vision (Erhan et ah, 2009; Simonyan et ah, 2013). 
It measures how much each input unit contributes 
to the final decision, which can be approximated 
by first derivatives. 

More formally, for a classification model, an 
input E is associated with a gold-standard class 
label c. (Depending on the NLP task, an input 
could be the embedding for a word or a sequence 


of words, while labels could be POS tags, sentiment 
labels, the next word index to predict etc.) Given 
embeddings E for input words with the associated 
gold class label c, the trained model associates 
the pair (E’, c) with a score Sc{E). The goal is to 
decide which units of E make the most significant 
contribution to Sc{e), and thus the decision, the 
choice of class label c. 

In the case of deep neural models, the class score 
Sc{e) is a highly non-linear function. We approxi¬ 
mate Sc{e) with a linear function of e by computing 











































Intensification 



0.8 

0.4 

0.0 

-0.4 


I *" -0.8 

... 



I like it 

I like it a lot 


I hate it 

I hate it so much 

the movie is good 

the movie is 
incredibly good 


Negation 



good 
not good 

bad 

not bad 

like 

n't like 


Figure 1: Visualizing intensification and negation. Each ver¬ 
tical bar shows the value of one dimension in the final sen¬ 
tence/phrase representation after compositions. Embeddings 
for phrases or sentences are attained by composing word rep¬ 
resentations from the pretrained model. 



i hate the movie though the plot is interesting 






Figure 3: Representations over time from LSTMs. Each col¬ 
umn corresponds to outputs from LSTM at each time-step 
(representations obtained after combining current word em¬ 
bedding with previous build embeddings). Each grid from the 
column corresponds to each dimension of current time-step 
representation. The last rows correspond to absolute differ¬ 
ences for each time step between two sequences. 


the first-order Taylor expansion 

Sc{e) ^ w{e)'^e + b (1) 

where w{e) is the derivative of Sc with respect to 
the embedding e. 


The magnitude (absolute value) of the derivative in¬ 
dicates the sensitiveness of the final decision to the 
change in one particular dimension, telling us how 
much one specific dimension of the word embed¬ 
ding contributes to the final decision. The saliency 
score is given by 

^(e) = |«;(e)| (3) 

5.1 Results on Stanford Sentiment Treebank 

We first illustrate results on Stanford Treebank. We 
plot in Figures 5, 6 and 7 the saliency scores (the 


absolute value of the derivative of the loss function 
with respect to each dimension of all word inputs) 
for three sentences, applying the trained model to 
each sentence. Each row corresponds to saliency 
score for the correspondent word representation 
with each grid representing each dimension. The 
examples are based on the clear sentiment indicator 
“hate” that lends them all negative sentiment. 

hate the movie” All three models assign high 
saliency to “hate” and dampen the influence of 
other tokens. LSTM offers a clearer focus on 
“hate” than the standard recurrent model, but the 
bi-directional LSTM shows the clearest focus, at¬ 
taching almost zero emphasis on words other than 
“hate”. This is presumably due to the gates struc¬ 
tures in LSTMs and Bi-LSTMs that controls infor¬ 
mation flow, making these architectures better at 
filtering out less relevant information. 
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Figure 8: Variance visualization. 


“I hate the movie that I saw last night” All 

three models assign the correct sentiment. The 
simple recurrent models again do poorly at filter¬ 
ing out irrelevant information, assigning too much 
salience to words unrelated to sentiment. However 
none of the models suffer from the gradient van¬ 
ishing problems despite this sentence being longer; 
the salience of “hate” still stands out after 7-8 fol¬ 
lowing convolutional operations. 

‘‘I hate the movie though the plot is interesting” 

The simple recurrent model emphasizes only the 
second clause “the plot is interesting”, assigning 
no credit to the first clause “I hate the movie”. This 
might seem to be caused by a vanishing gradient, 
yet the model correctly classifies the sentence as 
very negative, suggesting that it is successfully 
incorporating information from the first negative 
clause. We separately tested the individual clause 
“though the plot is interesting”. The standard recur¬ 
rent model confidently labels it as positive. Thus 
despite the lower saliency scores for words in the 
first clause, the simple recurrent system manages 
to rely on that clause and downplay the information 
from the latter positive clause—despite the higher 
saliency scores of the later words. This illustrates 
a limitation of saliency visualization, first-order 
derivatives don’t capture all the information we 
would like to visualize, perhaps because they are 


only a rough approximate to individual contribu¬ 
tions and might not suffice to deal with highly non¬ 
linear cases. By contrast, the LSTM emphasizes the 
first clause, sharply dampening the influence from 
the second clause, while the Bi-LSTM focuses on 
both “hate the movie” and “plot is interesting”. 

5.2 Results on Sequence-to-Sequence 
Autoencoder 

Figure 9 represents saliency heatmap for auto¬ 
encoder in terms of predicting correspondent to¬ 
ken at each time step. We compute first-derivatives 
for each preceding word through back-propagation 
as decoding goes on. Each grid corresponds to 
magnitude of average saliency value for each 1000- 
dimensional word vector. The heatmaps give clear 
overview about the behavior of neural models dur¬ 
ing decoding. Observations can be summarized as 
follows: 

1. For each time step of word prediction, 
SEQ2seq models manage to link word to predict 
back to correspondent region at the inputs (automat¬ 
ically learn alignments), e.g., input region centering 
around token “hate” exerts more impact when to¬ 
ken “hate” is to be predicted, similar cases with 
tokens “movie”, “plot” and “boring”. 

2. Neural decoding combines the previously 
built representation with the word predicted at the 
current step. As decoding proceeds, the influence 



















Figure 9: Saliency heatmap for Seq2Seq auto-encoder in 
terms of predicting correspondent token at each time step. 

of the initial input on decoding (i.e., tokens in 
source sentences) gradually diminishes as more 
previously-predicted words are encoded in the vec¬ 
tor representations. Meanwhile, the influence of 
language model gradually dominates: when word 
“boring” is to be predicted, models attach more 
weight to earlier predicted tokens “plot” and “is” 
but less to correspondent regions in the inputs, i.e., 
the word “boring” in inputs. 

6 Average and Variance 

For settings where word embeddings are treated as 
parameters to optimize from scratch (as opposed to 
using pre-trained embeddings), we propose a sec¬ 


ond, surprisingly easy and direct way to visualize 
important indicators. We first compute the average 
of the word embeddings for all the words within 
the sentences. The measure of salience or influence 
for a word is its deviation from this average. The 
idea is that during training, models would learn 
to render indicators different from non-indicator 
words, enabling them to stand out even after many 
layers of computation. 

Figure 8 shows a map of variance; each grid cor¬ 
responds to the value of | ^ J^i'eNs 11^ 

where ei^j denotes the value for j th dimension of 
word i and N denotes the number of token within 
the sentences. 

As the figure shows, the variance-based salience 
measure also does a good job of emphasizing the 
relevant sentiment words. The model does have 
shortcomings: (1) it can only be used in to scenar¬ 
ios where word embeddings are parameters to learn 
(2) it’s clear how well the model is able to visualize 
local compositionality. 

7 Conclusion 

In this paper, we offer several methods to help 
visualize and interpret neural models, to understand 
how neural models are able to compose meanings, 
demonstrating asymmetries of negation and explain 
some aspects of the strong performance of LSTMs 
at these tasks. 

Though our attempts only touch superficial 
boringpoints in neural models, and each method has its 
pros and cons, together they may offer some in¬ 
sights into the behaviors of neural models in lan¬ 
guage based tasks, marking one initial step toward 
understanding how they achieve meaning compo¬ 
sition in natural language processing. Our future 
work includes using results of the visualization be 
used to perform error analysis, and understanding 
strengths limitations of different neural models. 
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Appendix 



Recurrent Models A recurrent network succes¬ 
sively takes word wt at step t, combines its vector 
representation et with previously built hidden vec¬ 
tor ht-i from time t — 1, calculates the resulting 
current embedding ht, and passes it to the next step. 
The embedding ht for the current time t is thus: 

ht = f{W ■ ht-i + V-et) (4) 

where W and V denote compositional matrices. If 
Ns denote the length of the sequence, repre¬ 
sents the whole sequence S. is used as input a 
softmax function for classification tasks. 

Multi-layer Recurrent Models Multi-layer re¬ 
current models extend one-layer recurrent structure 
by operation on a deep neural architecture that en¬ 
ables more expressivity and flexibly. The model 
associates each time step for each layer with a hid¬ 
den representation where I G [1,T] denotes 
the index of layer and t denote the index of time 
step, hi^t is given by: 


Bidirectional Models (Schuster and Paliwal, 
1997) add bidirectionality to the recurrent frame¬ 
work where embeddings for each time are calcu¬ 
lated both forwardly and backwardly: 

ht = fiW^-ht+i + V^-et) 

Normally, bidirectional models feed the concate¬ 
nation vector calculated from both directions 
[e]“,e^^] to the classifier. Bidirectional models 
can be similarly extended to both multi-layer neu¬ 
ral model and LSTM version. 


ht,i = f{W • + V • (5) 

where ht^o = et, which is the original word embed¬ 
ding input at current time step. 

Long-short Term Memory LSTM model, first 
proposed in (Hochreiter and Schmidhuber, 1997), 
maps an input sequence to a fixed-sized vector by 
sequentially convoluting the current representation 
with the output representation of the previous step. 
LSTM associates each time epoch with an input, 
control and memory gate, and tries to minimize 
the impact of unrelated information, it, ft and 
ot denote to gate states at time t. ht denotes the 
hidden vector outputted from LSTM model at time 
t and et denotes the word embedding input at time 
t. We have 

it — crfWi ‘ et ~\~ Vi ‘ ht—i^ 
f, = a{Wf -et + Vf ht-i) 

Ot = a(Wo -et + Vo- ht-i) 
lt = tanh(Wret + Vrht-i) 
ct = ft- ct-i +itxlt 

ht = Of mt 

where a denotes the sigmoid function, it, ft and 
Ot are scalars within the range of [0,1]. x denotes 
pairwise dot. 

A multi-layer LSTM models works in the same 
way as multi-layer recurrent models by enable 
multi-layer’s compositions. 



